{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "8304721a",
   "metadata": {},
   "source": [
    "## Pandas-字符串的操作"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "270db25f",
   "metadata": {},
   "source": [
    "内容介绍:处理获取数据中的字符串。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "b642757e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "03874449",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "d    0\n",
      "b    1\n",
      "c    2\n",
      "a    3\n",
      "e    4\n",
      "dtype: int64\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>B</th>\n",
       "      <th>A</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>d</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <td>-8</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>c</th>\n",
       "      <td>4</td>\n",
       "      <td>6</td>\n",
       "      <td>-5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   B  A  C\n",
       "d  4  5  8\n",
       "b -8  3  3\n",
       "c  4  6 -5\n",
       "a  1  2  6"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 示例数据\n",
    "s0 = pd.Series(range(5),index=['d','b','c','a','e'])\n",
    "print(s0)\n",
    "df0 = pd.DataFrame(np.random.randint(-9,9,size=(4,3)),index=['d','b','c','a'],columns=['B','A','C'])\n",
    "df0"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61252f16",
   "metadata": {},
   "source": [
    "### 1.常用的字符串对象方法。python本身的字符串方法。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "4c7e958e",
   "metadata": {},
   "outputs": [],
   "source": [
    "s = 'a, b,   c'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "9893a230",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['a', ' b', '   c']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#分割字符串\n",
    "s.split(',')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "79ab9b00",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['a', 'b', 'c']"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 删除字符串中的空格，包括\\n\n",
    "l_s = [x.strip() for x in s.split(',') ]\n",
    "l_s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "5ff1dfb7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('a', 'b', 'c')"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a,b,c=l_s\n",
    "a,b,c"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "478f9eeb",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'a::b::c'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a+'::'+b+'::'+c"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "2c9fca76",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'a::b::c'"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#列表中的字符串的合并\n",
    "'::'.join(l_s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "a7b0ce29",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#检测字符串是否包含使用in关键字\n",
    "'c' in l_s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "8e8e4306",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#获取某个字符在字符串中的索引下标。如果元素没有，那么会返回异常。\n",
    "s.index(',')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "cfaeeae2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-1"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#在字符串中查找某个字符。如果没查到返回值-1。查找到时，返回索引值。\n",
    "s.find(':')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "2f766598",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'k, b,   c'"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#字符串的替换。替换不改变原始值。\n",
    "s.replace('a','k')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "0fbd9e0a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'a b   c'"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#替换目标值为空字符串时，表明去除某个字符串。\n",
    "s.replace(',','')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32d1448f",
   "metadata": {},
   "source": [
    "### 2.正则表达式"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "6980dccd",
   "metadata": {},
   "outputs": [],
   "source": [
    "import re"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "85c8c260",
   "metadata": {},
   "outputs": [],
   "source": [
    "text = 'foo   bar\\t  bat   \\tqq'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "6441fdff",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['foo', 'bar', 'bat', 'qq']"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#直接使用正则表达式对象\n",
    "#\\s表示空格和\\t换行符号，+号表示匹配0-n次\n",
    "re.split('\\s+', text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "a24c8442",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['foo', 'bar', 'bat', 'qq']"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#使用编译后的正则表达式对象\n",
    "#可以重复使用编译后的正则表达式对象\n",
    "res = re.compile('\\s+')\n",
    "res.split(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "53f3f4e1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['   ', '\\t  ', '   \\t']"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#字符串中查找，返回所有的匹配结果，返回这些结果的列表\n",
    "res.findall(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "d1f7f436",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['   ', '\\t  ', '   \\t']"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#findall函数的另一种写法，等效函数\n",
    "re.findall(res,text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "05080f21",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "None\n"
     ]
    }
   ],
   "source": [
    "#match函数的使用。需要从字符串的开头匹配。可以用于匹配整个单词。\n",
    "m1 = re.match('o',text)\n",
    "print(m1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "54983591",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'f'"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#match函数查看具体值\n",
    "m1.group()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "eaf9b0a5",
   "metadata": {},
   "outputs": [],
   "source": [
    "#使用serch进行匹配，只能找到一个匹配项\n",
    "s1 = re.search('b',text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "68a36df6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'b'"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s1.group()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "253f5e0f",
   "metadata": {},
   "source": [
    "### 3.pandas矢量化字符串函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "bd99fea0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a      asdf@ho.com\n",
       "b    sdwtr@lso.com\n",
       "c     qlop@pad.com\n",
       "d       qpx@ld.com\n",
       "e              NaN\n",
       "dtype: object"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dic = {\n",
    "    'a':'asdf@qq.com',\n",
    "    'b':'sdwtr@lso.com',\n",
    "    'c':'qlop@gmailcom',\n",
    "    'd':'qpx@gmail.com',\n",
    "    'e':np.nan\n",
    "}\n",
    "s30 = pd.Series(dic)\n",
    "s30"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "b1dd2802",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a    False\n",
       "b    False\n",
       "c    False\n",
       "d    False\n",
       "e     True\n",
       "dtype: bool"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s30.isna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "14b7332a",
   "metadata": {},
   "outputs": [
    {
     "ename": "AttributeError",
     "evalue": "'float' object has no attribute 'split'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
      "\u001b[0;32m/tmp/ipykernel_4700/2662890178.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# 获取邮箱名称的部分\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0ms30\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmap\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;32mlambda\u001b[0m \u001b[0mx\u001b[0m \u001b[0;34m:\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'@'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m~/.local/lib/python3.8/site-packages/pandas/core/series.py\u001b[0m in \u001b[0;36mmap\u001b[0;34m(self, arg, na_action)\u001b[0m\n\u001b[1;32m   4159\u001b[0m         \u001b[0mdtype\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   4160\u001b[0m         \"\"\"\n\u001b[0;32m-> 4161\u001b[0;31m         \u001b[0mnew_values\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msuper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_map_values\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marg\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mna_action\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mna_action\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   4162\u001b[0m         return self._constructor(new_values, index=self.index).__finalize__(\n\u001b[1;32m   4163\u001b[0m             \u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmethod\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"map\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/.local/lib/python3.8/site-packages/pandas/core/base.py\u001b[0m in \u001b[0;36m_map_values\u001b[0;34m(self, mapper, na_action)\u001b[0m\n\u001b[1;32m    868\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    869\u001b[0m         \u001b[0;31m# mapper is a function\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 870\u001b[0;31m         \u001b[0mnew_values\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmap_f\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmapper\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    871\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    872\u001b[0m         \u001b[0;32mreturn\u001b[0m \u001b[0mnew_values\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/.local/lib/python3.8/site-packages/pandas/_libs/lib.pyx\u001b[0m in \u001b[0;36mpandas._libs.lib.map_infer\u001b[0;34m()\u001b[0m\n",
      "\u001b[0;32m/tmp/ipykernel_4700/2662890178.py\u001b[0m in \u001b[0;36m<lambda>\u001b[0;34m(x)\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# 获取邮箱名称的部分\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0ms30\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmap\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;32mlambda\u001b[0m \u001b[0mx\u001b[0m \u001b[0;34m:\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'@'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;31mAttributeError\u001b[0m: 'float' object has no attribute 'split'"
     ]
    }
   ],
   "source": [
    "# 获取邮箱名称的部分\n",
    "#由于存在nan为浮点数类型，不能使用split进行分割\n",
    "s30.map(lambda x :x.split('@'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "3c0d55f9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a      [asdf, ho.com]\n",
       "b    [sdwtr, lso.com]\n",
       "c     [qlop, pad.com]\n",
       "d       [qpx, ld.com]\n",
       "e                 NaN\n",
       "dtype: object"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#使用Series.str方法进行分割。此种方法的分割，可以忽略一些错误值。\n",
    "s30.str.split('@')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "e2ab4aa0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a    False\n",
       "b    False\n",
       "c    False\n",
       "d    False\n",
       "e      NaN\n",
       "dtype: object"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Series.str检测是否包含某个字符\n",
    "s30.str.contains('gmail')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "fb980baa",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a    [@]\n",
       "b    [@]\n",
       "c    [@]\n",
       "d    [@]\n",
       "e    NaN\n",
       "dtype: object"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#查找特定字符\n",
    "s30.str.findall('@')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "2e3a81d0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a    asdf@\n",
       "b    sdwtr\n",
       "c    qlop@\n",
       "d    qpx@l\n",
       "e      NaN\n",
       "dtype: object"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#批量截取字符串\n",
    "s30.str[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b9f85b15",
   "metadata": {},
   "source": [
    "### pandas字符串的方法汇总:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d5ae1f8",
   "metadata": {},
   "source": [
    "pandas字符串方法总体上使用方法同python一致，但python面对的是单个字符串，pandas字符串方法面对的是批量的字符串\n",
    " * 具体各方法的说明可以参照：\n",
    " * http://www.pypandas.cn/docs/user_guide/text.html#%E6%96%B9%E6%B3%95%E6%80%BB%E8%A7%88"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
